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ABSTRACT 

We describe in this paper a new method for building an 
efficient algorithm for scheduling jobs in a cluster. Jobs are 
considered as parallel tasks (PT) which can be scheduled on 
any number of processors. The main feature is to consider 
two criteria that are optimized together. These criteria are 
the makespan and the weighted minimal average completion 
time (minsum). They are chosen for their complementarity, 
to be able to represent both user-oriented objectives and 
system administrator objectives. 

We propose an algorithm based on a batch policy with 
increasing batch sizes, with a smart selection of jobs in each 
batch. This algorithm is assessed by intensive simulation 
results, compared to a new lower bound (obtained by a re- 
laxation of ILP) of the optimal schedules for both criteria 
separately. It is currently implemented in an actual real-size 
cluster platform. 

Categories and Subject Descriptors 

F.2.2 [Analysis of Algorithms and Problem Complex- 
ity]: Nonnumerical Algorithms and Problems — Sequencing 
and scheduling; D.4.1 [Operating Systems]: Process man- 
agement — Scheduling, Concurrency 

General Terms 

Algorithms, Management 
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1. INTRODUCTION 
1.1 Cluster computing 

The last few years have been characterized by huge tech- 
nological changes in the area of parallel and distributed 
computing. Today, powerful machines are available at low 
price everywhere in the world. The main visible line of such 
changes is the large spreading of clusters which consist in a 
collection of tens or hundreds of standard almost identical 
processors connected together by a high speed interconnec- 
tion network The next natural step is the extension to 
local sets of clusters or to geographically distant grids |10|. 

In the last issue of the Top500 ranking (from November 
2003 0), 52 networks of workstations (NOW) of different 
kinds were listed and 123 entries are clusters sold either by 
IBM, HP or Dell. Looking at previous rankings we can see 
that this number (within the Top500) approximately dou- 
bled each year. 

This democratization of clusters calls for new practical 
administration tools. Even if more and more applications 
are running on such systems, there is no consensus towards 
an universal way of managing efficiently the computing re- 
sources. Current available scheduling algorithms were mainly 
created to provide schedules with performance guaranties for 
the makespan criterion (maximum execution time of the last 
job), however most of them are pseudo-polynomial, therefore 
the time needed to run these algorithms on real instances 
and the difficulty of their implementation is a drawback for 
a more popular use. 

We present in this paper a new method for scheduling 
the jobs submitted to a cluster inspired by several exist- 
ing theoretically well-founded algorithms. This method has 
been assessed on simulations and it is currently tested on 
actual conditions of use on a large cluster composed by 104 
bi-processor machines from Compaq (this cluster - called 
Icluster2 - was ranked 151 in the Top500 in June 2003). 



To achieve reasonable performance within reasonable time, 
we decided to build a fast algorithm which has the best fea- 
tures of existing ones. However, to speed up the algorithm 
a guaranteed performance ratio cannot be achieved, thus we 
concentrate on the average ratio on a large set of generated 
test instances. These instances are representative of jobs 
submitted on the Icluster |18| . 

1.2 Related approaches 

Some scheduling algorithms have been developed for clas- 
sical parallel and distributed systems of the last genera- 
tions. Clusters introduce new characteristics that are not 
really taken into account into existing scheduling modules, 
namely, unbalance between communications and computa- 
tions - communications are relatively large - or on-line sub- 
missions of jobs. 

Let us present briefly some schedulers used in actual sys- 
tems: the basic idea in job schedulers j!3| is to queue jobs 
and to schedule them one after the other using some simple 
rules like FCFS (First Come First Served) with priorities. 
MAUI scheduler 1111 extends the model with additional fea- 
tures like fairness and backfilling. 

AppleS is an application level scheduler system for grid. It 
is used to schedule, for example, an application composed of 
a large set of independent jobs with shared data input files 
g]. It selects resources efficiently and takes into account 
data distribution time. It is designed for grid environment. 

There exist other parallel environments with a more gen- 
eral spectrum (heterogeneous and versatile execution plat- 
form) like Condor I Ui| or with special capabilities like pro- 
cessus migration, requiring system-level implementation like 
Mosix 3 . However, in these environments scheduling algo- 
rithms are online algorithms with simple rules. 

1.3 Our approach 

As no fast and flexible scheduling systems are available 
today for clusters, we started two years ago to develop a 
new system based on a sound theoretical background and 
a significant practical experience of managing a big cluster 
(Iclusterl, a 225 PC machine arrived in 2001 in our lab). It 
is based on the model of parallel tasks 9 which are inde- 
pendent jobs submitted by the users. 

We are interested here in optimizing simultaneously two 
criteria, namely the minsum (S(7j) which is usually targeted 
by the users who all want to finish their jobs as soon as pos- 
sible, and the makespan (C m ax) which is rather a system ad- 
ministrator objective representing the total occupation time 
of the platform. 

There exist algorithms for each criterion separately; we 
propose here a bi-criteria algorithm to optimize the C m ax 
and EC*i criteria simultaneously. The best existing algo- 
rithm for minimizing the makespan off-line (all jobs are 
available at the beginning) has a 3/2 + e guaranty |7j. We 
can derive easily an on-line batch version by using the gen- 
eral framework of |21| leading to an approximation ratio of 
3 + e. For the other criterion, the best result is 8 for the 
unweighted case and 8.53 for the weighted case |19| . Us- 
ing a nice generic framework introduced by Hall et al.|12|. a 
(12;12) approximation can be obtained at the cost of a big 
complexity which impedes the use of such algorithms. 

The paper is organized as follows: In the next section, 
we will introduce the definitions and models used in all the 
paper. The algorithm itself is described in section [3] along 



with the lower bound which is used in the experiments. The 
experimental setting and the results are discussed in section 
^] Finally we will conclude in section|S]with a discussion on 
on-going works. 

2. CONTEXT AND DEFINITION 

2.1 Architectural and Computing Models 

The target execution support that we consider here is a 
cluster composed by a collection of a medium number of 
SMP or simple PC machines (typically several dozens or 
several hundreds of nodes). The nodes are fully connected 
and homogeneous. 



Job queue 




Figure 1: Job submission in clusters. 

The submissions of jobs is done by some specific nodes by 
the way of several priority queues as depicted in Figure Q 
No other submission is allowed. 

Informally, a Parallel Task (PT) is a task that gathers 
elementary operations, typically a numerical routine or a 
nested loop, which contains itself enough parallelism to be 
executed by more than one processor. We studied scheduling 
of one specific kind of PT, denoted as moldable jobs accord- 
ing to the classification of Feitelson et al. |S]. The number of 
processors to execute a moldable job is not fixed but deter- 
mined before the execution, as opposed to rigid jobs where 
the number of processors is fixed by the user at submission 
time. In any case, the number of processors does not change 
until the completion of the job. 

For historical reasons, most of submitted jobs are rigid. 
However, intrinsically, most parallel applications are mold- 
able. An application developer does not know in advance the 
exact number of processors which will be used at run time. 
Moreover, this number may vary with the input problem size 
or number of available nodes. This is also true for many nu- 
merical parallel libraries. The main exception to this rule is 
when a minimum number of processors is required because 
of time, memory or storage constraints. 

The main restriction in a systematic use of the moldable 
character is the need for a practical and reliable way to esti- 
mate (at least roughly) the parallel execution time as func- 
tion of the number of processors. Most of the time, the user 
has this knowledge but does not provide it to the scheduler, 
as it is not taken into account by rigid jobs schedulers. This 
is an inertia factor against the more systematic use of such 
models, as the users habits have to be changed. 



Our algorithm proposes, thanks to moldability, to effi- 
ciently decrease average response time (at the users request) 
while keeping computing overhead and idle time as low as 
possible (at the system administrators request). 

2.2 Scheduling on clusters 

The main objective function used historically is the makespan. 
This function measures the ending time of the schedule, i.e., 
the latest completion time over all the tasks. However, this 
criterion is valid only if we consider the tasks altogether and 
from the viewpoint of a single user. If the tasks have been 
submitted by several users, other criteria can be considered. 
Let us present briefly the two criteria: 

• Minimization of the makespan (C m ax = max(Cj) where 
the completion time Cj is equal to a(j)+pj(nbproc(j))). 
Pj represents the execution time of task j, a function 
is the starting time and nbproc function is the proces- 
sor number (it can be a vector in the case of specific 
allocations for heterogeneous processors). 

• Minimization of the average completion time (ECi) 
|20ll?l and its variant weighted completion time (EwjCj). 
Such a weight may allow us to distinguish some tasks 
from each other (priority for the smallest ones, etc.). 

In a production cluster context, the jobs are submitted 
at any time. Models were the characteristics of the tasks 
(duration, release date, etc) are only known when the task 
is submitted are called on-line as opposed to the off-line 
models were all the tasks are known and available at all 
times. It is possible to schedule jobs on-line with a constant 
competitive ratio for Cmax- The idea is to schedule jobs by 
batches depending on their arrival time. An arriving job 
is scheduled in the next starting batch. This simple rule 
allows constant competitive ratio in the on-line case if a 
single batch may be scheduled with a constant competitive 
ratio p. 

Roughly, the last batch starts after the last task arrival 
date. By definition, all the tasks scheduled in a batch are 
scheduled in less than pC' max , where C max is the optimal off- 
line makespan of the complete instance. The length of the 
previous last batch is then lower than pC max . Moreover, 
the length of the last batch, plus the starting time of the 
previous last batch (at which none of the tasks of the last 
batch were released) is less than p times the length of the 
optimal on-line makespan. 

As the on-line makespan is larger than the off-line makespan, 
the total schedule length is less than 2p times the on-line op- 
timal makespan. This is how the off-line 3/2 + e algorithm 
is turned into an on-line 3 + e algorithm as we said in the 
introduction. 

3. A NEW BICRITERIA EFFICIENT SO- 
LUTION 

3.1 Rationale 

Studying some extreme instances and their optimal sched- 
ules for the minsum criterion, gave us an insight on the shape 
of the schedules we had to build. For example, if all the tasks 
are perfectly moldable (when the work does not depend on 
the number of processors) the optimal solution is to sched- 
ule all the tasks on all processors in order of increasing area. 



This example shows that the minsum criterion tends to give 
more importance to the smaller tasks. 

Previous algorithms presented in the literature are also de- 
signed to take into account this global structure of schedul- 
ing the smaller tasks first. Shmoys et al. |12| used a batch 
scheduling with batches of increasing sizes. The batch length 
is doubled at each step, therefore only the smaller tasks are 
scheduled in the first batches. 

Existing makespan algorithms for moldable tasks are also 
designed with a common structure of shelves (were all tasks 
start at the same time) which is a relaxed version of batches. 
See for example |17| or [7j for schedules with 2 shelves. 

Our algorithm was built with this structure in mind: stack- 
ing tasks in shelves of increasing sizes with the additional 
possibility of shuffling these shelves if necessary. However, 
our main motivation was to design a fast algorithm for the 
management of some clusters of a big regional grid in Greno- 
ble. Our algorithm does not have a known performance 
guaranty on the worst cases, however we tested its behavior 
on a set of generated instances which simulate real jobs sub- 
mitted on our local clusters. The principle of the algorithm 
is shown in Figure |3] 




Figure 2: Principle of the algorithm. 

3.2 Algorithm 

More formally, we detail below the algorithm starting with 
the input describing the instances: 

• n tasks available at time 

• Pi(k) the processing time of task i on k processors 

• Wi is its weight 

• m the number of processors 



Compute the approximate C max with the dual approx- 
imation algorithm. 
tmin = mmij{pi(j)} 

^ = L lo g2(§r)j 

for j = 0..K + 1 do 

h - 
end for 

T = {l..n} 

for j — 0..K do 

S = {i ET such that 3j, p;(j) < tj} 

Merge the small sequential tasks sorted by decreasing 

weight. 

Select the set Sj C S of tasks to schedule in the cur- 
rent batch (using a knapsack). 
Schedule the batch between tj and tj+i. 
Remove Sj from T. 
end for 

Compact the schedule with a list algorithm using the 
batch ordering. 



First, our algorithm calls a dual approximation makespan 
algorithm (defined in [7]) to determine an approximation 
of the optimal makespan of the instance. With this value 
Cmax and the smallest possible duration of a task t m in, we 
compute the smallest useful batch size to (such that at least 
one task can be done) and K + 1 the number of batches. 
The values tj are the length of our batches. For every j, 
tj+i is twice the value of tj. 

The main loop of the algorithm corresponds to the selec- 
tion of the jobs to be scheduled in the current batch. We 
first select the tasks which are not too long to run in the 
batch. If there are several tasks that can be run in less than 
half the batch size on one processor, we can merge some of 
these tasks by stacking them together. In order to have as 
much weight as possible, this merge is done by decreasing 
weight order. 

The next step is to run a knapsack selection, written with 
integer dynamic programming. We want to maximize the 
sum of the weight of the selected tasks while using at most m 
processors. The allocation of the task i is alloti, the smallest 
allocation that fits (in length) into the batch. Values of 
W(i, j) are initialized to — oo for j < and otherwise. For 
i going from 1 to n and for j going from 1 to m, we compute: 

W(i, j) = max (W{i - 1, j), W(i - alloti) +Wi) 

The largest W(n, •) is the maximum weight that can be done 
in the batch. The complexity of this knapsack is 0{mn). 

The first schedule is simple: we start all the selected tasks 
of one batch at the same time. A straightforward improve- 
ment is to start a task at an earlier time if all the processors 
it uses are idle. A further improvement is to use a list algo- 
rithm with the batch ordering and a local ordering within 
the batches, as it allows to change the set of processors al- 
loted to the tasks. 

Finally, an additional optimization step is used. The 
batch order is shuffled several times and the best resulting 
compact schedule is kept. This only leads to small improve- 
ments. 

The overall complexity of this algorithm is 0(mnK). 

3.3 Lower Bound 

In order to assess this algorithm with experiments, for 
each instance we need to know the value of an optimal solu- 
tion. But since the problem is NP-Hard in the strong sense, 
computing an optimal solution in reasonable time is impos- 
sible. We are thus looking for good lower bounds. 

For C'max a good lower bound may easily be obtained by 
dual approximation For EC; the lower bound is com- 
puted by a relaxation of a Linear Programming formulation 
of the problem. This formulation is not intended to yield a 
feasible schedule, but rather to express constraints that are 
necessarily respected by every feasible schedule. For this 
formulation, we divided the time horizon into several inter- 
vals Ij — (ijjtj+i] with < j < K. The values of the tj and 
the value of K are defined as in the previous section. 

Once the time division is fixed, we consider the decision 
variables Xij = 1 if and only if task i ends within Ij (i.e. 
between tj and ij+i), and Xi,j — otherwise. 

For each task i and each interval j, we can also compute 
the minimal area occupied by task i if it ends before ij+i: 

Si.j = min {kpi(k) such that Pi{k) < tj+i} 

l<fc<m 

If the set is empty, let Sij = +oo. 



With these values, we can give the formulation of the 
problem: 

Minimize J2i.j Wit jXij 
Subject to Vi, 5^ . Xij > 1 

0<l<j 

Vi,Vj, ^6(0,1} 

The first constraint expresses that every task should be 
performed at least once. The minimization criterion implies 
that no task will be performed more than once: if Xij and 
x it ji are equal to one, we get a better, yet still feasible solu- 
tion by setting one of them to zero. 

The second constraint is a surface argument. For each 
interval Ij, we consider the tasks that end before or in this 
interval (they end in 7), for I < j). By definition, a task 
i ending in interval I takes up a surface at least Si t i. The 
sum of all these surfaces has to be smaller than the total 
surface between time and time tj+i, which is mtj+\. This 
is obviously optimistic, because it does not take into ac- 
count collisions between tasks: scheduling according to this 
formulation might require more than m processors. 

Both of these constraints are satisfied by every feasible 
schedules, so for every feasible schedule S, there is a solution 
R to this linear program. Since for each job i, £V tj x i,j < 
d, the objective function of R is not larger than the min- 
sum criterion of the schedule 5*. In particular, every optimal 
schedule yields a solution to the linear program, so the op- 
timal value of the objective function is always smaller than 
the optimal value of the minsum criterion of the schedul- 
ing problem. This still holds when considering the relaxed 
problem, where Xi t j is in [0; 1]. The lower bound might be 
weaker, but is much faster to compute. 

4. EXPERIMENTS 
4.1 Experimental setting 

The experimental simulations presented here were per- 
formed with an ad-hoc program. Each experience is ob- 
tained by 40 runs; for each run tasks are generated in an 
off-line manner, then given as an input to the scheduling 
algorithm and to the linear solver which computes a lower 
bound for this instance. Comparison between the two re- 
sults yields a performance ratio, and the average ratio for 
the whole set of runs is the result of the experiments. 

The runs were made assuming a cluster of 200 processors, 
and a number of tasks varying from 25 to 400. In order 
to describe a mono-processor task, only its computing time 
is needed. A moldable task is described by a vector of m 
processing times (one per number of processor alloted to the 
task). We used two different models to generate the tasks. 
The first one generates the sequential processing times of 
the tasks, and the second one uses a parallelism model to 
derive all the other values. 

Two different sequential workload type were used: uni- 
form and mixed cases. For all uniform cases, sequential 
times were generated according to an uniform distribution, 
varying from 1 to 10. For mixed cases, we introduce two 
classes : small and large tasks. The random values are 
taken with gaussian distributions centered respectively on 
1 and 10, with respective standard deviations of 0.5 and 5, 
the ratio of small tasks being 70%. 



Modeling the parallelism of the jobs was done in two dif- 
ferent ways. In the first, successive processing times were 
computed with the formula Pi(j) = Pi(j — where X 

is a random variable between and 1. Depending on the 
distribution of X, tasks generated are highly parallel (with 
a quasi-linear speedup) or weakly parallel (with a speedup 
close to 1). Respectively highly and weakly parallel are gen- 
erated using gaussian distribution centered on 0.9, and 0.1, 
and with a standard deviation of 0.2. Any random value 
smaller than and larger than 1 are ignored and recom- 
puted. According to the usual parallel program behavior, 
this method generates monotonic tasks, which have decreas- 
ing execution times and increasing work with k. For the 
mixed cases, the small tasks are weakly parallel and the 
large tasks are highly parallel. 

The second way of modeling parallelism was done accord- 
ing to a model from Cirne and Berman .5: , which relies on a 
survey about the behavior of the users in a computing cen- 
ter. Only the uniform(l, 10) sequential time model is used 
for theses tasks. 

To evaluate our algorithm, we use the lower bound (cf sec- 
tion IjOJ as reference. Some simple "standard" algorithms 
are used to compare the behavior and efficiency of our ap- 
proach. 

Gang : Each task is scheduled on all processors. The tasks 
are sorted using the ratio of the weight over the exe- 
cution time. This algorithm is optimal for instances 
with linear speedup. 

Sequential : Each tasks is scheduled on a single processor. 
A list algorithm is used, scheduling large processing 
time first (LPTF). 

List Graham: All the 3 algorithms are multiprocessor list 
scheduling Every tasks is alloted using the num- 
ber of processor selected by 0- This should lead to 
a very good average performance ratio with respect 
to the C m ax criterion. Only the order of the list is 
changing between the three algorithms : 

• the first one keep the order of |T , listing first task 
of the large shelf then the tasks of small shelf then 
the small tasks, 

• weighted largest processing time first (LPTF), a 
classical variant, with a very god behavior for 
Cmax criterion, but the tasks are in fact sorted 
using the ratio between weighted and their exe- 
cution time. 

• smallest area first (SAF), almost the opposite of 
LPTF, the tasks are sorted according to their area 
(number of processors x execution time). The 
goal is to improve the average performance ratio 
for the ^2 Wid criterion. 

In all experiments, task priority is a random value taken 
from an uniform distribution between 1 and 10. 

4.2 Simulation results 

The results of the simulation runs are given in all the 
following figures, plotting the minimum, maximum and av- 
erage values for Cmax and J2 w iCi- The average of the com- 
petitive ratio is computed by dividing the sum of the execu- 
tion times over the sum of the lower bounds for every point 



|15| . Every workload type are represented separately. The 
same scale is represented for identical criterion between the 
workload type. 

The tasks of Figure [3] are weakly parallel. This is the 
worst case for our algorithm as it spends resources to accel- 
erate completion of small and high priority parallel tasks. 
These resources are thus spend without much gain. Note 
that Gang scheduling does not appear in the presented range 
for Cmax, as Gang always has a very big ratio in this case. 

As expected, the average performance ratio for our algo- 
rithm is worse than all other algorithms except Gang. Nev- 
ertheless, the performance ratio for Cmax is no more than 
2. All other algorithms have an average performance ratio 
around 1.5. The difference is large enough to influence also 
the results for the minsum criterion. From this case we may 
deduce that for most cases, our algorithms will not be much 
worse than a performance ratio of 2 for both criterion. 

Weakly Parallel 
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Figure 3: Performance ratio for the simulation on 
200 processors, weakly parallel tasks 

Figure 2] presents the same experiments with the highly 
parallel tasks. On the minsum criterion, our algorithm is 
clearly the best one. Gang and sequential have opposite 
behavior on both criteria, Gang being good with a small 
number of tasks and sequential good for a large number of 
tasks only. The other algorithms are stable (with respect to 
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Figure 4: Performance ratio for the simulation on Figure 5: Performance ratio for the simulation on 
200 processors, highly parallel tasks 200 processors, mixed model parallel tasks 



the number of tasks) but with a larger ratio on the minsum. 
Remark that the allotment computed for list algorithms is 
quite good, as C max performance ratio of these algorithms 
is always smaller than 2. 

The next experiment (cf Figure |5J presents mixed in- 
stances with some large tasks and plenty of small tasks. In 
this cases our algorithm is still quite stable with a perfor- 
mance ratio of around 2 for both criterion, however SAF is 
better than our algorithm. The ratio of the two other list 
algorithms greatly increase with the number of tasks, which 
points out that the order of tasks is very important here. 

Finally, the last experiment use a well known workload 
generator which emulates real applications pp. In this more 
realistic setting our algorithm clearly outperforms the other 
ones for the minsum criterion, and is also the only one to 
keep a stable ratio for any number of tasks. 

Several observations can be made from these results. First, 
the performance ratio for the minsum criterion is never more 
than 2.5, and is on average around 2. The performance ratio 
for the makespan is almost always below 2, and is 1.9 on av- 
erage. This is very good, even for each criterion separately. 
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Figure 6: Performance ratio for the simulation on 
200 processors, cirne model parallel tasks 

The second observation is that our algorithm performs 
better when tasks are more parallel. This can be understood 



if we remark that, for a weakly parallel task, there is only 
one or two intervals in which it can be scheduled without 
degrading its performance. So the scheduling algorithm is 
more constrained when the tasks are not parallel. 

The SAF algorithm perform quite well on simple cases. It 
appears on complex cases that our approach is required to 
keep a good performance on the minsum criterion. Thus our 
algorithms should be preferred in actual applications as its 
performance ratio for minsum is insensitive to jobs behavior 
and its performance ratio for the makespan is not far from 
alternatives. 

Finally, Figure Q shows that the execution time of our 
scheduling algorithm is low (less than 2 seconds for the 
largest instances), as expected. 
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Figure 7: Execution time of the algorithm. 



5. CONCLUDING REMARKS 

In this paper we presented a new algorithm for scheduling 
a set of independent jobs on a cluster. The main feature 
is to optimize two criteria simultaneously. The experiments 
show that in average the performance ratio is very good, and 
the algorithm is fast enough for practical use. The algorithm 
has been assessed by comparing the minsum performance to 
a new lower bound based on the relaxation of an ILP, and 
comparing the makespan performance to the best known 
approximation. Actual results are not available at the mo- 
ment, but we are currently implementing this algorithm on 
a full-scale platform (Icluster2). 

Several technical problems still have to be solved for an 
even more efficient practical solution, namely the reserva- 
tion of nodes which reduces the size of the cluster and the 
mix of different types of jobs (moldable jobs, rigid jobs, and 
divisible load jobs). 
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